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ABSTRACT 

Top-fc query processing finds a list of fc results that have largest 
scores w.r.t the user given query, with the assumption that all the 
fc results are independent to each other. In practice, some of the 
top-fc results returned can be very similar to each other. As a re- 
sult some of the top-fc results returned are redundant. In the lit- 
erature, diversified top-fc search has been studied to return fc re- 
sults that take both score and diversity into consideration. Most 
existing solutions on diversified top-fc search assume that scores of 
all the search results are given, and some works solve the diver- 
sity problem on a specific problem and can hardly be extended to 
general cases. In this paper, we study the diversified top-fc search 
problem. We define a general diversified top-fc search problem that 
only considers the similarity of the search results themselves. We 
propose a framework, such that most existing solutions for top- 
fc query processing can be extended easily to handle diversified 
top-fc search, by simply applying three new functions, a sufficient 
stop condition sufficient(), a necessary stop condition necessary(), 
and an algorithm for diversified top-fc search on the current set 
of generated results, div-search-current(). We propose three new 
algorithms, namely, div-astar, div-dp, and div-cut to solve the 
div-search-current() problem, div-astar is an A* based algorithm, 
div-dp is an algorithm that decomposes the results into components 
which are searched using div-astar independently and combined 
using dynamic programming, div-cut further decomposes the cur- 
rent set of generated results using cut points and combines the re- 
sults using sophisticated operations. We conducted extensive per- 
formance studies using two real datasets, enwiki and reuters. Our 
div-cut algorithm finds the optimal solution for diversified top-fc 
search problem in seconds even for fc as large as 2, 000. 

1. INTRODUCTION 

Top-fc queries are one of the most fundamental queries used in 
the IR and database areas. Given a user query, the top-fc results of 
the query are a list of fc results that have largest scores/relevances 
with respect to the user query, under the assumption that all of the fc 
results are independent to each other. In some situations, for a cer- 
tain top-fc query, some of the results returned can be very similar to 
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each other. For example, if we search "apple" in Google image 1 , 
7 out of the top- 10 results returned are the logo of the Apple com- 
pany. In order to remove the redundancy in the results, and at the 
same time keep the quality of the top-fc results, diversity should be 
considered in the top-fc search problems. 

For top-fc search algorithms. In the literature, most of them aim 
at finding an early stop condition, such that they can find the top- 
fc results without exploring all the possible search results. Based 
on this, two frameworks are generally used, namely, the incremen- 
tal top-fc framework and the bounding top-fc framework. The in- 
cremental top-fc framework outputs the results one by one in non- 
increasing order of their scores, and stops as soon as fc results are 
generated. It aims to find a polynomial delay algorithm such that 
given the existing generated results, the next result with largest 
score can be generated in polynomial time w.r.t. the size of the 
input only [16, 15, 20, 14]. In the bounding top-fc framework, re- 
sults are not necessarily generated in non-increasing order of their 
scores. It maintains a score upper bound for the unseen results ev- 
ery time when a new result is generated. The algorithm stops when 
the current fc-th largest score is no smaller than the upper bound for 
the unseen results. The threshold algorithm based approaches [7, 
9] fall in this framework and other approaches include [12, 17]. 

Diversity aware search has been studied in recent years. Most of 
the existing solutions that support diversity on top-fc search results 
assume the ranking of all the search results are given in advance. 
Based on which, a diversity search algorithm is given to output fc 
results based on a scoring function that takes both query relevance 
and diversity into consideration [6, 1, 11, 5, 2]. Other works give 
algorithms that solve the diversity problem for a special area, i.e., 
graph search [18], document search [22], etc. and can hardly be 
extended to support general top-fc diversity search. 

In this paper, we propose a general framework to handle the di- 
versified top-fc search problem. We keep the advantages for the 
existing top-fc search algorithms, that can stop early without ex- 
ploring all search results, and at the same time, we take diversity 
into consideration. We show that any top-fc search algorithm that 
can be used in the incremental top-fc framework or the bounding 
top-fc framework can be easily extended to handle diversified top- 
fc search, by adding three new functions studied in this paper: a 
sufficient stop condition sufficient(), a necessary stop condition 
necessary(), and a diversity search function div-search-current(). 
All of them are application independent. The only assumption 
in our framework is that, given any two search results Vi and Vj, 
whether m and Vj are similar to each other can be decided, e.g., us- 
ing a similarity function sim (vt, Vj) > r for a user given threshold 
T. We output a list of fc results with maximum total scores such that 
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no two of them are similar to each other. We make the following 
contributions in this paper. 

(1) We formalize the diversified top-fc search problem. Based on 
our definition, the optimal solution only depend on the similarity 
of search results themselves, and no other information is needed. 

(2) We study two categories of algorithms generally used in finding 
top-fc results with early stop in the literature, namely, the incre- 
mental top-fc framework and the bounding top-fc framework. We 
show both frameworks can be extended to diversified top-fc search 
by simply adding three application independent functions studied 
in this paper, namely, a sufficient stop condition sufficient(), a nec- 
essary stop condition necessary(), and a diversity search function 
div-search-current(). The sufficient stop condition helps to early 
stop and the necessary stop condition helps to reduce the number of 
div-search-current() processes, since div-search-current() is usu- 
ally a costly operation. 

(3) We show that div-search-current() is an NP-Hard problem and 
is hard to be approximated. We propose three new algorithms, 
namely, div-astar, div-dp, and div-cut, to find the optimal solution 
for div-search-current(). div-astar is an A* based algorithm and 
is slow to handle a large number of results, div-dp decomposes the 
results into disconnected components in order to reduce the graph 
size to be searched using div-astar. Results in div-dp are com- 
bined using dynamic programming, div-cut further decomposes 
each component into several subgraphs to form a cptree, based on 
the cut points of each component. A tree based search is applied on 
cptree to find the optimal solution. 

(4) We conducted extensive performance studies using two real 
datasets, to test the performance of the three algorithms. Our div-cut 
approach can find the diversified top-fc results within seconds when 
fc is as large as 2, 000. 

The rest of this paper is organized as follows. Section 2 formally 
defines the diversified top-fc search problem. Section 3 shows the 
two existing frameworks on general top-fc search problems. Sec- 
tion 4 shows how to extend the two categories of top-fc search ap- 
proaches to solve diversified top-fc search, by defining a sufficient 
stop condition sufficient(), a necessary stop condition necessary(), 
and a diversified top-fc search algorithm div-search-current() to 
search on the current result set. Section 5, 6, and 7 give three al- 
gorithms to solve the div-search-current() problem. We show our 
experimental results in Section 8, and introduce the related work in 
Section 9. Finally, we conclude our paper in Section 10. 

2. PROBLEM DEFINITION 

We consider a list of results S = {vi, V2, • • • }. For each Vi G S, 
the score of Vi is denoted as score(ui). For any two results Vi G S 
and Vj G S, there is a user defined similarity function sim(i>i, Vj) 
denoting the similarity between the two results Vi and Vj . Without 
loss of generality, we assume < s\m(vi,Vj) < 1 for any two 
results Vi G S and Vj G S, and sim(u,i;) = 1 for any v G S. 
Given an integer fc where 1 < k < the top-fc results of S is a 
list of fc results Sk that satisfy the following two conditions. 

1) 5* C5and|S fc | = fc. 

2) For any Vi G Sk and Vj G S — Sk, score(u;) > score^). 
Here, S — Sk is the set of results that are in S but not in Sk, i.e., 

s-Sk = {v\ves,v^s k }. 

Given two results Vi G S and Vj G S, Vi is similar to Vj iff 
sim(«j, Vj) > t where r is a user defined threshold, and < r < 
1. We use Vi ~ Vj to define that Vi is similar to Vj . 

Definition 1 (Diversified Top-fc Results) Given a list of search 
results S = {vi,V2,- ■ ■ }, and an integer fc where 1 < fc < \S\, the 
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Figure 1: A sample diversity graph 

diversified top-k results of S, denoted as D(S), is a list of results 
that satisfy the following three conditions. 

1) D(S) <ZRand\D(S)\ < fc. 

2) For any two results Vi G R and Vj G R and Vi 7^ Vj, ifVi ~ Vj, 
then{vi,Vj} D(S). 

3) 2ZlveD(S) score ( v ) i s maximized. 

Intuitively, D(S) is the set of at most k results, such that no two 
results are similar with each other, and the total score of the results 
is maximized. We use score(D(S)) to denote the total score of 
results in D(S), i.e., score(D(S')) = 2~2 v <zd(S) score ( v )- 

In this paper, we are to find the diversified top-fc results. Our 
aim is to find a general approach, such that for any existing algo- 
rithm that returns the top-fc results of a certain problem, it can be 
easily changed to return the diversified top-fc results by applying 
our framework, in which the result set S is not necessarily to be 
computed in advanced but grows incrementally with an early stop 
condition. We first give the definition of the diversity graph. 

Definition 2 (Diversity Graph) Given a list of results S = {vi, 
V2, ■ ■ ■ }, the diversity graph of S, denoted as G(S) = (V, E), 
is an undirected graph such that for any result v G R, there is 
a corresponding node v G V, and for any two results Vi G S and 
Vj G R, there is an edge (vi,Vj) G E iffvt « Vj. We use V(G(S)) 
and E(G(S)) to denote the set of nodes and the set of edges in the 
diversity graph G(S) respectively, and use «.adj(G(S 1 )) to denote 
the set of nodes that are adjacent to v in G(S). If the context is 
obvious, we use Vi to denote both the result Vi in S and the node Vi 
in G(S), we use G to denote G(S), and we use D to denote D(S). 
Without loss of generality, we assume nodes in G(S) are arranged 
in non-increasing order of their scores, i.e., for any 1 < i < j < 
\V{G(S))\, score(«i) > score(w 3 ). 

The diversified top-fc results D(S) can be equivalently defined 
as a subset of nodes in G(S), that satisfy the three conditions. 

1) \D(S)\<k. 

2) D(S) is an independent set of G(S). 

3) score(7J(S')) is maximized. 

Here, an independent set of a graph is a set of nodes in a graph, 
where no two nodes are adjacent. 

Example 1 Fig. 1 shows the diversity graph for 6 results S = {v±, 
V2, ■ ■ ■ , v§}. Suppose k — 2, the optimal solution D(S) includes 
two points Vi and V2 with score 18, as shown on the left part of 
Fig. 1. Suppose fc = 3, the optimal solution D(S) includes three 
points vs, f 4 and v$ with score 20, as shown on the right part of 
Fig. 1. 

In the following, we first show the two existing frameworks to 
solve top-fc search problems, namely, the incremental top-fc frame- 
work and the bounding top-fc framework, which are most generally 
used in top-fc search algorithms. Then we show the framework of 
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Algorithm 1 incremental(fc) 



for i=l to k do 

v «— incremental-next(); 
if v = then 

break; 
S<-S\J{v}; 
return S; 



Algorithm 3 div-search(fc) 

1: S 4- 0; D(S) <- 0; 

2: while sufficientQ do 

3: the code to update S (and unseen); 

4: if necessary() then 

5: D(S) <- div-search-current(G(S), fc); 

6: return D(5); 



Algorithm 2 bounding(fc) 

1: 5^0; 

2: unseen < hoo; 

3 : while the fc-th largest score of S < unseen do 

4: v 4 — bounding-next(); 

5: if v = then 

6: break; 

7: 5 <- 5UW; 

8: update unseen; 

9: return top-fc results in S; 



our approach to extend the two frameworks to handle diversified 
top-fc search. 

3. TOP-A SEARCH FRAMEWORKS 

In the literature, the framework of most algorithms that find top- 
fc results falls into two categories, namely, the incremental top-fc 
framework and the bounding top-fc framework. 

Incremental Top-fc: In the incremental top-fc framework, results 
are generated one by one by calling a procedure incremental-next(), 
with non-increasing order of their scores. The algorithm stops after 
fc results are generated, and the fc results are the final top-fc results 
for the problem. The framework named incremental is shown in 
Algorithm 1. A lot of existing work fall into this category, e.g., 
finding top-fc shortest paths in graphs, finding top-fc steiner trees, 
communities and r-cliques in graphs, etc [16, 15, 20, 14]. A lot 
of works have been done to assume that the time complexity of 
each incremental-next() procedure to generate the next result with 
largest score is polynomial w.r.t. the size of the input only. 

Bounding Top-fc: In the bounding top-fc framework, results are 
generated one by one by calling a procedure bounding-next(), but 
not necessarily with non-increasing order of their scores. A bound 
unseen is defined to be the upper bound of the scores for the un- 
seen results. After each result is generated by bounding-next(), 
unseen is also updated to be a possibly smaller value. The algo- 
rithm stops when the fc-th largest score of all generated results is 
no smaller than the upper bound for the unseen results unseen. The 
framework named bounding is shown in Algorithm 2. The thresh- 
old algorithm that is generally used to return top-fc results falls into 
this category [7, 9]. Other works that fall into this category include 
[12, 17]. 

4. DIVERSIFIED TOP-A SEARCH 

In this section, we show how to extend the incremental top-fc 
framework incremental and bounding top-fc framework bounding 
to handle diversified top-fc search. We mainly focus on two tasks. 
First, a new early stop conditions is needed. Second, an algorithm 
that finds the diversified top-fc results for the current generated re- 
sult set is needed. For the early stop condition, in the original al- 
gorithm, the stop condition for incremental is simply \S\ = fc and 
the stop condition for bounding is the current fc-th largest score 
< unseen. Obviously, both of them cannot be applied to handle 



diversified top-fc search. Consider an extreme case, when the al- 
gorithm stops using the original stop condition, it is possible that 
all the results generated are similar to each other. Thus the current 
diversified top-fc results only contain 1 result with the largest score. 
It is not the optimal solution because it is possible that an unseen 
result is not similar to the current one. Here, D(S) computed for 
the current generated result set S can be used to check the new stop 
condition, and if the new stop condition is satisfied, D(S) is the 
optimal solution for the diversified top-fc search. 

We extend both incremental and bounding using the same frame- 
work, which is shown in Algorithm 3, by adding three new func- 
tions, a new sufficient stop condition sufficient(), a new necessary 
stop condition necessary() and an algorithm div-search-current() 
to search the diversified top-fc results on the current generated re- 
sult set. The algorithm executes the code of the original top-fc al- 
gorithm to update S and stops when sufficientQ is satisfied. For 
incremental, the code is line 3-6 in algorithm 1, and for bounding, 
the code is line 4-8 in algorithm 2. After updating S, we construct 
the diversity graph G(S) on S based on the similarity function 
sim() for any given two results. If the necessary stop condition is 
satisfied, we find the diversified top-fc results for the current result 
set S using div-search-current(). The necessary stop condition is 
used to reduce the number of calling div-search-current(), because 
div-search-current() is a costly work. In the following, we will in- 
troduce the sufficient stop condition, the necessary stop condition, 
and the search algorithm for current set. 

Sufficient Stop Condition: Given the current result set S, we need 
to calculate an upper bound best(S') for the possible optimal solu- 
tions considering both the current result set S and the unseen re- 
sults. Let Di(S) be the best diversified results of S with exactly 
i elements for 1 < i < k, i.e., Di(S) is a subset of nodes in 
V(G(S)), that satisfies the following three conditions. 

1) \Di{S)\ = fc. 

2) Di(S) is an independent set of G(S). 

3) score(Di(5')) is maximized. 

Lemma 1 Given Di (S) for 1 < i < k and the score upper bound 
of all the unseen results u. The upper bound best(S') can be calcu- 
lated as follows. 

best(S) = max {score(A(S)) + (fc - i) x u} (1) 

l<i<fc 

where u is the score of the last generated result v, score(w), for 
incremental and is the upper bound of the unseen results, unseen, 
for bounding. 

Proof Sketch: Suppose the final optimal solution is O, then we 
can divide O into two parts, O — Oi \J O2, where 0\ is the set of 
generated results, and O2 is the set of unseen results. Suppose 0\ 
has ni elements and O2 has ri2 elements. We have rii + 112 < k. 
Since Oi is the set of generated results, we have (1) score(Oi) < 
score(D ni (S)), since D ni (S) is the optimal solution with m el- 
ements. We also have (2) score(C>2) < ri2 x u < (fc — rii) x u, 
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since (u) is the score upper bound for all unseen results. Com- 
bine (1) and (2), we have score(O) = score(Oi) + score(02) < 
score{D ni {S) ) + ( fc — ni) x u < maxi<i<*{score(A(<S')) + 
(fc — i) x m} = best(5). best(S') is an upper bound for the optimal 
solution. □ 
Having the score upper bound best(S) for the optimal solution, 
the sufficient stop condition for div-search can be defined as fol- 
lows. 

score(D(5)) > best(S) (2) 

The following lemma shows that, after every iteration, div-search 
moves towards the sufficient stop condition. 

Lemma 2 For any S' C S, best(S') < best(S) andbest(S') > 
best(5). 

Proof Sketch: Since S' C S, the best solution on S' is a feasible 
solution on S, thus best(S'') < best(S'). Comparing to best(5 ,/ ), 
best(S) is calculated by changing some upper bounds u' when cal- 
culating best(S') into the real scores no larger than u' and chang- 
ing the other unseen upper bounds from u' to u, where u < u' is 
assumed by the original algorithm. Thus best(5 ,/ ) > best(.S'). □ 

Necessary Stop Condition: We discuss the necessary stop con- 
dition for div-search. The necessary stop condition is used as fol- 
lows. In each iteration, before invoking div-search-current(), if the 
necessary stop condition is not satisfied, then div-search-current() 
is not necessarily to be invoked in this iteration. 

Lemma 3 For div-search, if it can stop in a certain iteration, one 
of the following conditions should be satisfied before invoking the 
procedure div-search -current(): 

1 ) The last generated result v = 0. 

2) \S\ > \S'\ + fc - max{i|l < i < k, A(5") / 0} and the k-th 
largest score in S > u. 

Here S' is the set of results when the last div-search-current() is 
invoked or z/div-search-currentO is never invoked. 

Proof Sketch: The first condition is trivial. Now suppose v ^ 0. 
For the second condition, when the fc-th largest score in S < u, it is 
possible that a new result can be added that updates the k-th largest 
score, and thus improves the current best solution. Now we discuss 
\S\ > \S'\ + k - max{i|l < i < fc,A(<S") ^ 0}. max{i|l < 
i < k,Di(S') 7^ 0} is the size of the maximum independent set for 
G(S') if it is smaller than fc, and fc — max{i| 1 < i < k, Di(S') ^= 
0} is the minimum number of nodes needed to be added in order to 
generate a result of size fc. If such a result does not exist, we cannot 
stop because we can always add some unseen nodes to any existing 
solution with a size smaller than fc to make the score larger. As a 
result, we should add at least fc — max{i|l < i < fc, Di(S') ^ 0} 
nodes into S' . □ 

Searching Current Set: The most important operation in our frame- 
work is the the algorithm div-search-current() to search the diver- 
sified top-fc results for the current result set S. We first show the 
difficulties of the problems in this section and give three algorithms, 
namely div-astar, div-dp, and div-cut on div-search-current() in 
the next three sections respectively. 

The following lemma shows that finding the diversified top-fc 
results is an NP-Hard problem. 

Lemma 4 Finding D(S) on G(S) is an NP-Hard problem. 
Proof Sketch: We consider a special case of the problem, where 
score(u) = 1 for all v G V(G(S)), and fc = \V(G(S))\. In 
such a case, finding Dk(R) on G(S) is equivalent to finding the 




(a) The Greedy Solution (b) The Optimal Solution 
Figure 2: The greedy algorithm 
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Figure 3: Overview of three algorithms 

maximum independent set on graph G(S), which is an NP-Hard 
problem. Thus, the original problem is an NP-Hard problem. 

Greedy is Not Good: Given G(S) and fc, a simple greedy algo- 
rithm to find D(S) works as follows. It processes in iterations. In 
each iteration, the node v with the maximum score is selected and 
put into D(S). After that, all the nodes that are adjacent to v in 
G(S) is removed from G(S). The process stops when G(S) is 
empty or D(S) contains fc results. 

The quality of the greedy algorithm can be arbitrarily bad. The 
approximation ratio for the greedy algorithm is not bounded by a 
constant factor. Even for its special case, the maximum indepen- 
dent set problem is known to be hard to approximate in the litera- 
ture. We give an example. Fig. 2 shows a diversity graph with 201 
nodes and 200 edges. Suppose fc = 100. Using the greedy algo- 
rithm, the solution is shown in Fig. 2(a), where the selected results 
are marked gray. The score of the greedy solution is 199. The op- 
timal solution for the problem is shown in Fig. 2(b). The score of 
the optimal solution is 9, 900, which is nearly 50 times of the score 
of the greedy solution. 

In the following, we propose to find the optimal solution of D(S). 
We propose three algorithms, namely, div-astar, div-dp, and div-cut 
div-astar searches the whole space S using the A* based heuris- 
tics by designing an upper bound function astar-bound(). Based 
on the NP-Hardness of the problem, div-astar can hardly handle 
problems with large diversity graph G. In our second div-dp al- 
gorithm, we decompose G into connected components. The size 
of each component can be much smaller than the original graph 
G, and is searched independently using div-astar. We combine the 
components using an efficient operation © based on dynamic pro- 
gramming. In our third div-cut algorithm, we further decompose 
each connected component into subgraphs, where subgraphs are 
connected through a set of cut points. Each subgraph is searched 
independently for at most 4 times under different conditions. We 
combine the components using two efficient operations © and ®, 
The general ideas of the three algorithms are illustrated in Fig. 3. 

5. AN A* BASED APPROACH 

As discussed in Section 4, div-search-current(G(S'), fc) should 
return the optimal solution Di(S) for 1 < i < k in order to find 
the early stop condition. For simplicity, we use D to denote the set 
of solutions, and we use D.solutiorii to denote the optimal solution 
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Algorithm 4 div-astar(G, k) 

Input: The diversity graph G, the top-fc value. 
Output: Search result D. 
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/{D.scorei} do 



n <- 0;-D <- 0; 
«.push((0, 0,0,0)); 
for k' = k down to 1 do 

astar-search(G,W, £>,&'): 
for all e g % do 

e. bound <— astar-bound(G, e, fc'); 
update e in "H ; 
return D; 

procedure astar-search(G, "H, -D, fc') 
while H ^ and "H.top. bound > max^j,/ 
e <— "H.popO; 

fori = e.pos + 1 to \V(G)\ do 

if -Ui.adj(G) f] e. solution = then 

e' <— (e. solution U{«i}, i, e. score + score(«i), 0); 
e'. bound <— astar-bound(G, e', fc'); 
W.push(e'); 

update D using e'. solution; 

procedure astar-bound(G, e, fc') 
p <— |e.solution|; % e.pos + 1; 
bound «— e. score; 
while p < fc' and i < | V(G)| do 
if t)i.adj(G) p| e. solution = then 
bound <— bound + score(-Ui); 
p <— p + 1; 
t i + 1; 



return bound; 



with i results Di(S), and use D. score; to denote the score for the 
optimal solution score(Di(S)). 

Our first algorithm is an A* based algorithm. The algorithm 
is shown in Algorithm 4. We define a max heap % to store the 
entries in the A* search. Each entry e 6 His with the form 
e = (solution, pos, score, bound). Each entry e is ranked in H 
according to e. bound, which is the estimated upper bound of the 
solution if we further expand it in the A* search, e. solution is 
the partial solution searched and e.pos is the position of the last 
searched node in e. solution, e. score is the score of the partial solu- 
tion, i.e., e. score = score(e. solution). The algorithm should return 
7J. solution; for all 1 < i < k. Suppose we have an A* algorithm 
that finds the optimal solution for a certain D.solutioni, the algo- 
rithm should be invoked k times to find the k solutions, which is 
costly. We show that after searching D. solution, for a certain i, the 
partial solutions in T-L can be reused when searching D.solutionj 
for j < i. In the following, we first discuss the estimated upper 
bound for partial solutions. Then we discuss the A* algorithm to 
find the optimal solution D. solution, for a certain i. At last, we 
discuss how the partial solutions in T-L can be reused to find the 
optimal solutions 7J. solution; for all 1 < i < k. 

Upper Bound Estimation: Given a partial solution e, for a cer- 
tain k\ we show how to estimate the score upper bound if we ex- 
pand the partial solution to be a solution of at most k' elements. 
The algorithm astar-bound is shown in Algorithm 4, line 18-26. 
The newly added nodes should at least satisfy the following two 
conditions: 1) they can not be one of e. solution, and 2) they are 
not adjacent to any node in e. solution. Under such conditions, we 
can just add the set of nodes with largest scores, and after adding 
the nodes, the total number of nodes is no larger than k' . In or- 
der to satisfy condition 1), we visit nodes in G from the posi- 
tion e.pos + 1 (line 19). Since nodes in G are sorted in the non- 



increasing order of their scores, we add nodes one by one until the 
size p reaches k' . For each node added, condition 2) can be checked 
using «;.adj(G) f] e. solution = (line 22). 

Lemma 5 astar-bound(G, e, k') finds the score upper bound for 
the partial solution e. solution to be expanded to a solution of at 
most k' elements. 

Proof Sketch: Suppose we have removed all the nodes from G 
that are adjacent to at least one node in e. solution, then the func- 
tion astar-bound(G, e, k') calculates the upper bound by expand- 
ing e. solution using the set of nodes after position e.pos in G with 
largest scores. The optimal solution that e. solution can be ex- 
panded also selects the expanded nodes from the set of nodes after 
position e.pos but it may not select all with the largest scores since 
some of them may be adjacent to each other. Thus the optimal so- 
lution can not be larger than astar-bound(G, e, k'). As a result, 
astar-bound(G, e, k') is a score upper bound for all expansions of 
e. solution. □ 

A* Search for a Certain k: To find the optimal solution for a 
certain k = k', the A* search algorithm astar-search is shown in 
Algorithm 4, line 9-17. It runs in iterations. In each iteration, the 
partial solution e with the largest estimated upper bound is popped 
out from H (line 11). e can then be expanded to new partial solu- 
tions by adding a new node into e. solution. The nodes are added 
from position e.pos + 1 in G since all nodes before the position 
has been processed (line 12). The newly added node v t should not 
be adjacent to one of e.solution(line 13), and after adding the new 
node, the upper bound of the new partial solution should be updated 
using astar-bound (), and the new partial solution should be pushed 
into T-L for further expansion (line 14-16). In line 17, suppose the 
new partial solution e' has j elements, the new partial solution is 
considered as a solution of size j, and used to update the current 
best solution D. solution.;, if D. score., < e'. score. The iteration 
stops if either H is or the largest upper bound in T-L is no larger 
than the current best score max;< fc /{ZX scores } (line 10). 

Reusing Partial Solutions: In Algorithm 4, line 1-8 show how to 
share the same H to compute D.solution fe / for 1 < k' < k, with- 
out constructing H from scratch each time k' changes. It processes 
with decreasing order of k' (line 3). After processing k! , the partial 
solutions in H can be reused when processing k' — 1, in order to 
save computational cost. If we simply keep the current entries in 
H, they cannot be used directly to process fe' — 1. It is because the 
upper bounds for each partial solution are calculated by expanding 
to a solution of size k', which is not the upper bounds for a solution 
of size k' — 1. In order to reuse the partial solutions in T-L, we need 
to recalculate the upper bounds for all partial solutions in H using 
k' — 1 and update the new positions for elements in H (line 5-7). 
The following lemma shows the correctness of the approach. 

Lemma 6 The partial solutions in T-L for calculating D. solution; 
can be reused when calculating D.solution;_i. 

Proof Sketch: Consider the possibly expanded node em T-L such 
that e. solution = 7J.solution;_i. There is a unique path from the 
root of "H to e. (1) Suppose e is not removed from H currently, 
then there exists a unique ancestor of e in the current %. Since 
the upper bounds have been updated and e is the optimal solution 
D.solution;_i, e can be expanded when calculating D.solution;_i. 
(2) Suppose e has been removed from H currently, then e has 
been used to update 7J.solutioni_i after removal. Since the upper 
bounds for all entries in % have been updated and e is the optimal 
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Figure 4: Finding diversified top-3 results 




Algorithm 5 operator ffi(D', D") 



Input: Search results D' and D" 
Output: Search result D = D' 6 



for subgraphs. 
) D". 



0: 



D • 

for i = 1 to k do 

D. score; 4- 0; D. solution; <— 0; 
for j = to i do 

if D'. solution., ^ or j = then 
if D".solutiorii_j ^ ori = j then 

if D'.scorej + D".score;_j > D. scores then 
D. score; <— D'.scorej + D".score;__,; 
D. solution; +— D'.solutiorij (J D".solutionj_j; 

return D; 




Algorithm 6 operator ®(D', D") 



|{i'i.i.'2}.18.18 | {ul 11. 11| 

Figure 5: Finding diversified top-2 results 

solution ZXsolutiorij-i, the process can stop directly before pop- 
ping any entry from T-L. From (1) and (2), we conclude that after 
reusing the partial solution, D.solution;_i can be calculated. □ 



Input: Search results D' and D" for subgraphs. 
Output: Search result D = D' ® D" '. 

1: D «- 0; 

2: for % = 1 to k do 

3: D. scores <— 0; D. solution; <— 0; 

4: if D'. score; > D". scores then 

5: D. score; <— D'. score;; 

6: D. solution; <— D'. solution;; 

7 : else 

8: D. score; <— D". score;; 

9: Z). solution; 4— D". solution;; 

10: return D; 



Example 2 Consider the diversity graph shown in Fig. 1. Suppose 
k = 3. T7ie process for the div-astar search is shown in Fig. 4. 
First, entry (0, 0, oo) is popped from the heap Ti, and 6 new entries 
are pushed. The result {t>i} is with upper bound 19 because after 
selecting Vi, nodes V3, 1)5 and V4, that are adjacent to Vi should 
be excluded when calculating the upper bound of {v±}. The only 
nodes left are V2 and Ve with scores 8 and 1 respectively. Thus the 
upper bound for {ui} is score({i;i}) + 8 + 1 = 19. The one with 
the highest score is ({^3}, 7, 20), and thus it is popped from H 
in the next iteration and generates two new entries. Accordingly, 
({v3, V4}, 14, 20) and ({«3, V4, «s}, 20, 20) are popped from H 
in order. At this moment, D.scorei = 10, D.score2 = 14, and 
D.scoreg — 20. The stop condition is satisfied, and D. solutions 
is the optimal diversified top-3 results. Consider now we compute 
Z).solutiori2 since it is not the optimal solution currently. We do not 
need to reconstruct T-L from scratch. We update all upper bounds 
for entries that exists on the current H. In this example, entry 
({i>i}, 10, 19) is updated to be {{vi}, 10, 18) as shown in Fig. 5. 
Continue the iteration, ({t>i}, 10, 18) and ({vi, V2}, 18, 18) are 
popped from Ti in order. At this moment, D.score2 is updated to be 
18 and the stop condition is satisfied. Thus 18 is the best score for 
k = 2. 



6. A DP BASED APPROACH 

The div-astar algorithm is not suitable to handle large diversity 
graph G since the search space for div-astar increases exponen- 
tially with respect to the size of G and k. In order to reduce the 
size of the diversity graph G used for div-astar search. In this sec- 
tion, we decompose G into a set of disconnected components. We 
show that we only need to process each disconnected component 
separately using div-astar, and the solution for each disconnected 
component can be combined efficiently to the solution of the whole 
graph G using dynamic programming. Before introducing the al- 
gorithm, we first introduce two operators, ® and ®. 



The © Operator: The © operator has two operands, search result 
D' and search result D". For 1 < i < k, D.solutioni is the solu- 
tion of size i with the largest score by combining some nodes in D' 
and other nodes in D" . The algorithm to compute D = D' © D" 
is shown in Algorithm 5 using dynamic programming. It calculates 
D. solution; one by one for 1 < i < k (line 2). For a certain i, 
we try to select j nodes from D 1 and the left i — j nodes from 
D" for all < j < i (line 4). For a certain j, it can generate a 
feasible solution from D' and D" if the two conditions are satis- 
fied: 1) D'.solutionj / or j = 0, and 2) D'^solutioni-j / 
or i — j = (line 5-6). D.solutioni is the one that results in the 
largest total score (line 7-9). The time complexity for Algorithm 
5 is 0(k 2 ). The © operator is suitable to operate on two search 
results that are generated from two disjoint subgraphs respectively. 
The © operator has the following two properties. 
(Commutative law) D © D' = D' © D. 
(Associative law) (D © D') © D" = D © (D' © D"). 

The ® Operator: Similar to the © operator, the ® operator is 
operated on two operands, search result D' and search result D" . 
For 1 < i < k, D. solution; is the solution of size i that are the 
best of D'.solutioni and D" . solution;. The algorithm to compute 
D = D' © D" is shown in Algorithm 6. It calculates D. solution, 
one by one for 1 < i < k (line 2). For a certain i, D. solution; is 
set to be D' . solution; if D' . score; > D" . score;, and is set to be 
D" . solution; otherwise (line 4-9). The time complexity for Algo- 
rithm 6 is 0(k). The © operator is suitable to operate on two search 
results that are generated from the same subgraph. The © operator 
will be used and discussed in the next section. The © operator has 
the following two properties. 
(Commutative law) D © D' = D' © D. 
(Associative law) (D © D') © D" = D © (£>' © D"). 

The Overall Approach: The overall approach to compute D is 
shown in Algorithm 7. We find the set of disconnected components 
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Algorithm 7 div-dp(G, k) 

Input: The diversity graph G, the top-fc value. 
Output: Search result D. 



let C = {Gi , G2 , ■ ■ ■ } be the set of connected components of G; 
for all d G C do 

D' «— div-astar(Gi, fc); 

return D; 





Figure 6: A sample diversity graph 

C of G (line 2). We then process each Gi G C individually using 
div-astar search (line 4). With the commutative law and associa- 
tive low of operator, the search results for subgraphs in C can be 
combined with D in an arbitrary order using operator © (line 5). 

Example 3 Consider the diversity graph G shown in Fig. 6 that 
contains two connected components G\ and G2. Suppose k = 5, 
and suppose the results Di of Gi and D2 of G2 have been com- 
puted using div-astar algorithm separately. We now combine Di 
and D2 to compute the result for D for G. Suppose we now com- 
pute D. solutions, using the © operator shown in Algorithm 5, if 
we select 1 node from Gi and select 4 nodes from G2, we got a 
score since I?2-solution4 = 0. If we select 2 nodes from Gi and 
select 3 nodes from G2, we got a score 40, and if we select 3 nodes 
from Gi and 2 nodes from G2 we got a score 38. We search all 
the possible combinations, and select the best of them, which is 40, 
with 2 nodes from Gi and 3 nodes from G2, as the best solution 
Z). solutions. The operation © to combine D± and D2 is shown in 
Fig. 7. 

7. A CUT POINT BASED APPROACH 

The dynamic programming based approach divides the diversity 
graph G into components and each component can be searched sep- 
arately. When one of the components is large, it is still a costly 
work to search the single component. Consider a certain connected 
component Gi, although it is connected, it may contain several 
subgraphs that are loosely connected, i.e., connected through a set 
of cut points, where a cut point of a graph Gi is a single node 
v G V(Gi) such that Gi is disconnected if removing v from Gi. In 
this section, we show that the subgraphs connected through some 
of the cut points can be considered separately by applying div-astar 
search at most 4 times under different assumptions, and their search 
results can be combined using a series of © and © operations. 

The Cut Point Tree (cptree): Given a connected graph G, the cut 
point tree (cptree) of G is a tree formed by a subset of cut points 
of G. Each node o of cptree is with the form o = (o. cut-point, 
o. entry-graph, o. left-graph, o.subnodes, o. result), o. cut-point is 
the corresponding cut point representing the node. o. entry-graph 
is the subgraph of G that connects o. cut-point and the cut-point 
of the farther node of o on cptree. If there are more than one such 
graphs, then o. entry-graph is a disconnected graph that contains 
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Figure 7: Dynamic programming using the © operator 

Algorithm 8 div-cut(G, k) 

Input: The diversity graph G, the top-fc value. 
Output: Search result D. 



} be the set of connected components of G; 



D <- 0; 

letC = {Gi,G 2 ,- 
for all Gi G G do 

cp-compress(Gi); 

if cut-points(Gi) = then 
D' Jr- div-astar(G;, fc); 

else 

o <— cptree-construct(Gi, cut-points(Gi)); 
cp-search(G;, o, fc); 
D' <— o.resultn ® o.resulti; 
D <- D®D'\ 
return D; 



all of them. It is possible that o. entry-graph = 0. o. left-graph 
is the graph that does not contain any cut point after removing 
o. cut-point from G. If there are more than one such graphs, then 
o. left-graph is a disconnected graph that contains all of them. It 
is possible that o. left-graph = 0. o.subnodes is the set of sub- 
nodes for o in the cptree. o. result contain two results, o.resulto 
and o.resulti. o.result is the search result for the subtree rooted 
at o such that o.cut-point is excluded, and o.resulti is the search 
result for the subtree rooted at o such that o.cut-point is included. 
We use o.cut-point to denote o if the context is obvious. 

Graph Compression: In order to increase the number of cut points 
in a graph and thus reduce the size of the sub-components after 
removing some of the cut points. We study how to compress a 
graph G. By compression, we mean some nodes can be removed 
from the graph if the final solution D on G is not influenced. The 
following lemma shows how to compress a graph G. 



Lemma 7 Given the diversity graph G, a node Vi can be removed 
from G if there exists a node Vj that satisfies the following three 
conditions. 

1) Vj G Vi.adj(G). 

2) score(tjj) > score(ui). 

3) v j .adj(G) C Vi.adj{G)[j{v t }. 

After removing Vi, the optimal solution on the new graph is the 
same with the optimal solution on the original graph. 

Proof Sketch: We prove that for any solution V that contains Vi, 
we can get a solution by replacing Vi with Vj and the score is not 
decreased. First, we prove after replacing Vi with Vj , the solution 
is still a feasible solution. Since Vi and Vj are adjacent (the first 
condition), Vj can not be contained in the original solution. Since 
each node that Vj connects is connected to Vi (the third condition), 
and there are no nodes in V that are adjacent to V4, after replacing 
Vi with Vj in V, there are still no nodes in V that are adjacent 
to Vj. Thus, the new solution is still an feasible solution. Since 
score(vj) > scored ) (the second condition), the score of the new 
solution is no smaller than the score of the original solution V. □ 
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Algorithm 9 cptree-construct(G, cut-points) 

Input: Graph G, a set of cut points cut-points. 
Output: The root cpnode o of the cptree. 

1: cpnode o <— 0; 

2: v a node in cut-points with smallest entry-graph(u); 

3: o. cut-point <— i>; 

4: o. entry-graph <— entry-graph(u); 

5: let G = {Gi, G2, ■ ■ ■ } be the connected components after removing 

v and entry-graph (u) from G; 
6: for all d e G do 
7: if cut-pointsfl V(Gj) 7^ then 
8: o' <— cptree-construct(Gi, cut-points Q V(Gi)); 
9: o.subnodes <— o.subnodes lJ{o'}; 

10: else 

11: o. left-graph <— o. left-graph (J Gi; 

12: return o; 



Algorithm 10 cp-search(G, o, k) 

Input: Graph G, the cpnode o of the cptree, the top-fe value. 
Output: The search result o.resulto and o.resulti. 



1: for all o' G o.subnodesdo 

2: cp-search(o'); 

3 : for 7 — to 1 do 

4: ifi = lthen 

5: mark(o. cut-point. adj(G)); 

6: o.resulti <— div-cut(remove-mark(o. left-graph), fc); 

7: for alio' £ o.subnodesdo 

8: D^0; 

9: forj = 0toldo 

10: if j = 1 and i = 1 and o'. cut-point 6 o.cut-point.adj(G) 

then 

11: break; 
12: if j = 1 then 

13: mark(o'. cut-point. adj(G)); 

14: D' <— div-cut(remove-mark(o'. entry-graph), fc); 

15: D' «— o'.resultj ffi D'\ 

16: D^D®D'; 

17: if j = 1 then 

18: unmark(o'. cut-point. adj(G)); 

19: o.result; <- o.resulti ffi D\ 

20: if j = 1 then 

21: update o.resulti by adding node o. cut-point into every solution 

of o. cut-point; 
22: unmark(o. cut-point. adj(G)); 



Example 4 Fig. 8 shows a sample diversity graph. In the graph, 
wi can be removed since there exits a node W2 that connects w\, 
and every node that w% connects is connected to Wi. After com- 
pression, the new graph is shown in Fig. 9. The cptree of the new 
graph is shown on the left most part of Fig. 11, where there are 3 
nodes, W2, W4 and w$ with root W2. The W4. entry-graph is G'i 
which is marked on Fig. 9, and the left graph of W4 is a graph that 
contains only one node WJ3. 

Solution Overview: The cut point based solution is outlined in Al- 
gorithm 8. Similar to Algorithm 7, it first decomposes the diversity 
graph G into disconnected components (line 2). For each compo- 
nent Gi, we first compress it by removing nodes based on Lemma 
7 (line 4). If there are no cutting points, we simply search Gi using 
div-astar (Algorithm 4). Otherwise, we construct the cptree with 
root o for Gi and search on the cptree from root node o to calculate 
o.resulto and o.resulti (line 8-9). o.resulto and o.resulti are com- 
bined using the ® operator since they are for the same subset of 
nodes (line 10). The results for different components are combined 
using the © operator since they are for different subset of nodes in 




Figure 9: Compressed diversity graph 

G (line 11). We discuss how to construct the cptree and how to 
search the cptree below. 

Constructing the cptree: Given a diversity graph G, the cptree for 
G is constructed as follows. First, the set of cut points, cut-points, 
is computed using the Tarjan's algorithm with linear time w.r.t. the 
size of G. Then each subtree of cptree is constructed recursively 
based on a certain subgraph G' . The root node v of the subtree 
is selected as follows: if v is the root of the whole cptree, v is 
the node in cut-points, such that after removing v, the maximum 
component of G is minimized. Otherwise, v is selected as the 
node in cut-points, such that after removing v from G', the size 
of the component that is connected to v's farther node in the orig- 
inal graph G, denoted as entry-graph(?j), is maximized. For other 
components in G' after removing v, they can be divided into two 
categories. The first category includes components with no node 
in cut-points. Such components are added to the left-graph of v. 
The other category includes components with at least one node in 
cut-points. Each of such components is considered as a subtree of 
v in cptree and is created recursively. The algorithm for construct- 
ing the cptree is shown in Algorithm 9. 

Searching the cptree: The aim of searching the cptree is to com- 
pute o.resulto and o.resulti for every node o on the cptree in a 
bottom-up fashion. For a certain node o on the cptree, suppose the 
resulto and resulti for all o's subnodes have been computed, we 
need to compute o.resulto and o.resulti. The algorithm to search 
the cptree is shown in Algorithm 10. 

We explain Algorithm 10 using an example. Fig. 10 shows a 
cptree with 3 nodes, 012, 034 and 024 connecting 4 graphs Gi, G2, 
G3 and G4. G34 consists of G3, G4 and 03,4. G12 consists of 
Gi, G2 and 012, and G consists of G12, G34 and 024. For sim- 
plicity, in this example, we use the graph itself to denote the search 
result on the graph. For a cutting point o on a graph G, we use 
G.include D to denote the optimal solution on G that o is included, 
and use G. exclude,, to denote the optimal solution on G that o is 
excluded. Suppose G3.include 034 , G3.exclude 031 , Gi.include D12 , 
and Gi .exclude 012 have been computed. We show how to compute 
G.include 024 and G.exclude 024 . 

Computing G.exclude 094 : It is the case for i — in Algorithm 10 
(line 3). Since 024 is excluded, we have G.exclude 024 = G34 © 
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Figure 10: cptree search example 

G\2 (line 19 in the for loop from line 7). We now discuss how to 
compute G12, and G34 can be computed in a similar way. There 
are two situations: 

1. 0x2 is excluded. It is the case for j = (line 9). In such 
a situation, we can compute G' 12 = Gi.exclude 012 © G2 
(line 15). 

2. 012 is included. It is the case for j = 1 (line 9). In such a 
situation, we can compute G'{% — Gi.include 012 © (G2 — 
oi2.adj(G)) (line 15), where G2 — 0i2.adj(G) is to remove 
the adjacent nodes of 012 from G2 (line 13). 

After computing 1 and 2, G12 can be computed as G12 = G'12 ® 
G" 2 (line 16). 

Computing G.include D24 : It is the case for i = 1 in Algorithm 
10 (line 3). Since 024 is included, we should remove nodes that 
are adjacent to 024 in G4 and G 2 (line 5). Thus G.include 02J = 
(G34 — o 24 .adj(G)) © (Gi 2 — o 2 4-adj(G). Note that since 024 
is included, we should also add 024 into G.include 021 after com- 
puting the © operation (line 21). We now discuss how to compute 
(G12 — D24.adj(G)), and (G34 — 024.adj(G)) can be computed in 
a similar way. There are two situations: 

1. 012 is excluded. It is the case for j = (line 9). In such a 
situation, we can compute G' 12 = Gi.exclude 012 © (G 2 — 
o 24 .adj(G)) (line 15). 

2. 012 is included. It is the case for j = 1 (line 9). In such a 
situation, we can compute G" 2 = Gi.include 012 © (G2 — 
024.adj(G)— oi 2 .adj(G)) (line 15), where G 2 — o 2 4.adj(G) — 
oi2.adj(G) is to remove the adjacent nodes of 012 from G2 — 
o 24 .adj(G) (line 13). 

After computing (1) and (2), (G12 — 024.adj(G)) can be computed 
as (G12 - o 24 .adj(G)) = G' 12 ® G'/ 2 (line 16). 

From the above discussion, on the cptree, for each node d , 
d . entry-graph needs to be searched for at most four times de- 
pending on whether o. cut-point and d . cut-point are included or 
not, where o is the father node of d on cptree. For the above ex- 
ample, G2 is searched four times, using G2, (G 2 — oi2.adj(G)), 
(G 2 — o 2 4-adj(G)) and (G 2 — o 2 4-adj(G) — Oi 2 .adj(G)) respec- 
tively. There are two more cases that the above example is not 
considered. First, when o. left-graph is not 0, the search result on 
o. left-graph should also be combined into o.resulto and o.resulti 
using operator © (line 6). Second, when o and d are adjacent, and 
both o and d are included, it is not a feasible solution (line 10-11). 

Example 5 For the diversity search graph shown in Fig. 8, sup- 
pose k = 5, after graph compression (Fig. 9), the cptree is shown 
in the left most part of Fig. 11. For the root node w 2 , suppose 
the solutions for its subnodes W4 and W5 are computed. The opti- 
mal solution u>2.resulto (exclude 11)2) is computed by combing two 
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Figure 11: Search the cptree 
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Figure 12: Keyword queries 

results using operator ffi: (1) the optimal solution for G[ (J the 
subgraph represented by 11)4, and (2) the optimal solution for G' 2 U 
the subgraph represented by W5. When computing (1), we should 
combine the following two results using the operator ®: (a) the 
result by including W4, and (b) the result by excluding Wi. (a) can 
be computed by first removing all nodes from G[ that are adjacent 
to Wa, and combining the optimal solutions for G[ and W4.resulti 
using operator ©. The optimal solution for G'i can be computed 
using div-cut again, (b) can be computed in a similar way. The 
optimal solution iii2.resulto is shown in the middle lower part of 
Fig. 11, and the optimal solution it)2.resulti is shown in the middle 
upper part of Fig. 11. After combining TO2.resulto and W)2.resulti 
using operator ®, the final solution is shown in the right part of 
Fig. 11. 



8. PERFORMANCE STUDIES 

We conducted extensive performance studies to test the algo- 
rithms proposed in this paper. We implemented three algorithms, 
denoted div-astar (Algorithm 4), div-dp (Algorithm 7), and div-cut 
(Algorithm 8) that follow the framework shown in Algorithm 3 
with different implementations on div-search-current. All algo- 
rithms were implemented in Visual C++ 2008 and all tests were 
conducted on a 2.8GHz CPU and 2GB memory PC running Win- 
dows XP. 

We use two real datasets, enwiki 2 and reuters 3 . enwiki includes 
11,930,681 articles from the English Wikipedia, and reuters in- 
clude 21,578 news from Reuters. Given a user keyword query q, 
we search the top-fc documents using the TF*IDF score normal- 
ized by the length of the corresponding document, which is defined 
as follows for each document d. 



score(q, d) 



£, i6? */(<Zi,d)xid/(<fc) 



\/len(d) 



(3) 



where tf(qi, d) is term frequency of keyword qi in d, idf(q r ) is 
the inverted document frequency for keyword qi, which is defined 
as idf(qi) = log |{ dg c .^g d }| +1 f° r the dataset D, and len(d) is 
the total number of words in d. Given any two documents d\ and 
d2, suppose all stop words have been removed from d\ and d 2 , the 



2 http://en.wikipedia.org/wiki/Wikipedia:Database_download 
http://kdd.ics.uci.edu/databases/reuters21578/ 
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similarity for di and d,2 is defined as follows, based on the weighted 
Jaccard distance. 



div-dp EE 



sim(di, da) = 



E 



idf(w) 



w£Ldi \J d,2 



idf(w) 



(4) 



where d\ f] di is the multi-set of words that appear in both d\ and 
d2, and d\ IJ g(2 is the multi-set of words that appear in either di or 
da. 

For enwiki, we test the scalability for keyword queries with mul- 
tiple keywords, where the results for each keyword are sorted ac- 
cording to the scores and stored in the inverted index. The results 
for all keywords are aggregated using the threshold algorithm [8]. 
For reuters, we test the scalability for keyword queries where each 
keyword query only contains one keyword. The results are out- 
put incrementally by sequentially scanning the inverted index for 
the keyword. For each testing, we record the processing time and 
the peak memory consumption. The processing time/peak memory 
consumption is the total time/peak memory consumed in searching 
the diversified top-fc results. When all the 2GB memory is used up, 
the algorithm cannot compute the diversified top-fc results. We use 
I N F to denote such a situation. 

For each dataset, we vary 3 parameters, k, t and kfreq. k is 
the top-fc value, r is the similarity threshold, and kfreq is the av- 
erage keyword frequency for the corresponding query. For each 
dataset, we select representative queries with different keyword 
frequencies as follows. After removing all the stop words, we 
set the maximum keyword frequency among all keywords as tt, 
and divide the keyword frequency range between and tv into 5 
partitions, namely, n/5, 2n/5, 3tv/5, 4/7t/5 and n. For simplic- 
ity, we say a keyword has frequency p(p 6 {1,2,3,4,5}), iff 
its frequency is between (p — 1) • 7r/5 and p ■ tt/5. We also 
set two groups of k values. The small k values and the large 
k values. Since div-astar is not suitable to be processed when 
k is large, in the large k value group, we only compare the two 
algorithms div-dp and div-cut. For enwiki, k is selected from 
{40, 80, 120, 160, 200} with default value 120 for small k values 
and selected from {500, 700, 900, 1300, 2000} with default value 
900 for large k values, r is selected from {0.4, 0.5, 0.6, 0.7, 0.8} 
with default value 0.6, and kfreq ranges from 1 to 5 with default 
value 3. For reuters, k is selected from {60, 80, 100, 110, 120} 
with default value 100 for small k values and selected from {500, 
700, 900, 1300, 2000} with default value 900 for large k values. 
The small k values selected in reuters are different from those in 
enwiki because div-astar can hardly handle queries when k is as 
large as 200 in reuters. r is selected from {0.4, 0.5, 0.6, 0.7, 0.8} 
with default value 0.6, and kfreq ranges from 1 to 5 with default 
value 3. When varying a certain parameter, the values for all the 
other parameters are set to their default values. The set of keywords 
with different kfreq are shown in Fig. 12. 

Exp-1 (Test enwiki): The testing results on the enwiki dataset 
when varying k are shown in Fig. 13. Fig. 13 (a) and Fig. 13 
(b) show the processing time and memory consumption when k is 
small. When k increases, the processing time for all the three algo- 
rithms div-cut, div-dp, and div-astar increase, div-astar increases 
sharply and div-dp and div-astar keep stable. When k reaches 200, 
div-astar takes more than 200 seconds and consumes more than 
200MB memory while both div-cut and div-astar take less than 
0.1 seconds and consumes less than 10KB memory. The process- 
ing time and memory consumption for large k values are shown in 
Fig. 13 (c) and Fig. 13 (d) respectively. When k increases, the time 
and memory consumption for div-dp increase sharply while the 
time and memory consumption for div-cut increase slowly. This 
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is because, when k increases, the size of the largest component for 
the diversity graph increases, but it can still be decomposed into rel- 
atively smaller subgraphs using cut points. When k reaches 2, 000, 
div-dp cannot generate the result and div-cut can compute the op- 
timal solution within 15 seconds using less than 200KB memory. 

When r varies from 0.4 to 0.8, the testing results on the enwiki 
dataset are shown in Fig. 14. As shown in Fig. 14 (a) and Fig. 14 
(b), for small k, when r increases, the processing time and mem- 
ory consumption for div-astar decrease, and the time and memory 
consumption for div-cut and div-dp keep stable. This is because 
when r is large, a large number of search results are not similar to 
each other. Thus in the div-astar search, the estimated upper bound 
is tight and thus the algorithm stops early. For large k values, as 
shown in Fig. 14 (c) and Fig. 14 (d), when r increases, the time 
and memory consumption for both div-cut and div-dp decrease. 
When t is 0.4, div-dp cannot generate a result because when r is 
small, a lot of search results are similar to each other, and thus the 
largest component for the diversity graph is large, div-cut can com- 
pute the optimal solution within 15 seconds using less than 1MB 
memory. 

Fig. 15 shows the testing results on the enwiki dataset when vary- 
ing kfreq. When kfreq increases, both the processing time and 
memory consumption do not have an obvious trend to increase or 
decrease. This is because whether two search results are similar to 
each other is not dependent largely on the keyword frequency for 
the query. Fig. 15 (a) and Fig. 15 (b) show the processing time and 
memory consumption for small k values, div-cut and div-dp have 
similar performance and div-astar is more than 100 times slower 
and consumes 1000 times more memory, comparing to div-cut and 
div-dp in all cases. The processing time and memory consump- 
tion for large k values when varying kfreq are shown in Fig. 15 (c) 
and Fig. 15 (d) respectively, div-dp is more than 2 times slower and 
consumes 10 times more memory comparing to div-cut in all cases. 
When kfreq = 2, div-dp cannot generate a result and div-cut can 
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finish in less than 10 seconds using less than 500KB memory. 

Exp-2 (Test reuters): The testing results when varying k in the 
reuters dataset are shown in Fig. 16. Fig. 16 (a) and Fig. 16 (b) 
show that, for small k values, when increasing k, the processing 
time and memory consumption for div-cut and div-dp keep stable 
and the time/memory consumption for div-astar increase sharply. 
When k reaches 120, div-astar cannot generate a result and div-cut 
and div-dp can finish in less than 0.1 seconds using less than 1KB 
memory. For large k values, as shown in Fig. 16 (c) and Fig. 16 (d), 
when k is less than 900, div-cut and div-dp have almost the same 
performance. When k = 500, div-dp even consumes smaller mem- 
ory than div-cut. It is because when k is small, the components 
for the diversity graph are all small, thus both div-cut and div-dp 
consumes small memory, but div-cut needs extra space to put the 
cptree. When k is as large as 2000, div-dp is 10 times slower than 
div-dp and consumes more than 1000 times more memory. 

The curves for reuters when varying r are shown in Fig. 17. 
As shown in Fig. 17(a) and Fig. 17(b), for small k values, when 
t increases, the time/memory consumption for div-astar decrease 
sharply and div-cut and div-dp keep stable. When r < 0.5, the 
div-astar algorithm cannot generate a result while div-astar and 
div-cut can compute the optimal solution within 0.01 seconds us- 
ing more than 1KB memory. The results for large k values are 
shown in Fig. 17 (c) and Fig. 17 (d). When r is larger than 0.6, 
div-cut and div-dp have almost the same performance. When r de- 
creases, the time/memory consumption for div-dp increase sharply 
and div-cut keeps stable. For r = 0.4, div-dp can compute the op- 
timal solution in more than 500 seconds using more than 500MB 
memory, while div-cut can compute the optimal solution in less 
than 5 seconds using less than 200KB memory. 

Fig. 1 8 shows the testing results when varying kfreq in the reuters 
dataset. Again, the time/memory consumption does not have an ob- 
vious trend to increase or decrease. For small k values, as shown 
in Fig. 18 and Fig. 18, div-cut and div-dp have similar processing 
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Figure 18: Vary kfreq (reuters) 

time, but div-dp consumes more memory than div-cut. div-astar is 
much slower and consumes much more memory than div-cut and 
div-dp in all cases. For large k values, as shown in Fig. 18, the gap 
between div-dp and div-cut is not as large as those in the enwiki 
dataset. It is because in the enwiki dataset, the number of docu- 
ments is large, and thus documents that fall into the same category 
can be similar to each other with high probability, and in the reuters 
dataset, the number of documents is small, and thus the probability 
that two documents are similar to each other is small, div-cut is 
faster and consumes smaller memory than div-dp in all cases. 

9. RELATED WORK 

In general, our problem of finding diversified top-k results is re- 
lated to three problems in the literature, namely, traditional top-k 
query, diversified top-k query, and maximum weight independent 
set. 

Traditional top-k query: In a top-k query, each result is associated 
with a score and the k results with largest scores are reported as the 
top-k results. General techniques to answer top-k queries follow 
into two categories. 

In the first category, all the results define a solution space. The 
approach recursively partitions the solution space into subspaces 
based on the best result in the current subspace, and the next best re- 
sult is the one with largest score among the best result in each sub- 
space. Therefore, the top-k results are generated one-by-one using 
Lawler's procedure [16], and the approaches based on it are in [15, 
20, 14]. Lawler [16] proposes a general procedure for computing 
top-k results to discrete optimization problems and also discusses 
its application to k shortest path problem. Based on Lawler's pro- 
cedure, Kimelfeld and Sagiv [15] study how to find top-k steiner 
trees, Qin et al. [20] focus on finding top-k communities, Kargar 
and An [14] find r-cliques in a graph. 

In the other category, the results are generated in a heuristic or- 
der, and an upper bound score is computed for all the ungenerated 
results. The algorithm stops if the scores of current top-k results are 
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no smaller than the upper bound score. A detailed survey can be 
found in [13]. The general setting is that, one ranked list is defined 
for each query feature, and the score of each result is an aggreation 
of scores corresponding to query features, which is a classical set- 
ting in Information Retrieval. The most inflential algorithm is pro- 
posed by Fagin et al. [7, 9], which considers both random access 
and/or sequential access of the ranked lists. Other works consider 
the scenario that only sequential accesses of the ranked lists are 
allowed [12, 17]. 

Diversified top-k query: In the traditional top-k queries, it re- 
turns results based on their relevance scores only. More and more 
works propose to take the diversity (or redundancy) into consider- 
ation to return a more satisfied result list for user queries [2, 1, 23, 
22, 11]. The problems studied about diversity expand a wide vari- 
ety of spectrum, e.g., diversified keyword search in documents [2], 
diversified prestige node finding in information networks [18], di- 
versified keyword query recommendation [23], diversified docu- 
ment monitoring in information filtering system [22], diversified 
keyword query interpretation over structured databases [6], and di- 
versified keyword search over graphs [11], and so on. 

In general, for a diversity-aware query, the results are returned 
in an ordered list and a redundancy value is computed for each 
result based on the content of results preceding it. Then, a use- 
fulness score is computed by combining the relevance score and 
redundancy value, and results are ranked with respect to the use- 
fulness score [2, 1, 6, 11]. The approaches to find top-k diver- 
sified results by usefulness scores generally consist of two steps, 
which first computes a top-? (I > fc) results based on the relevance 
score only and then reranks the I results based on the usefulness 
score using a greedy algorithm [6, 1, 11, 5]. Considering the effi- 
ciency aspect, Angel and Koudas [2] propose a one-step approach 
to answering diverisfied top-k queries. They couple their algorithm 
with the threshold algorithm which is designed for traditional top- 
k query [9]. An upper bound usefulness score is computed for 
the non-retrieved documents, the current fc documents with largest 
scores are the top-k results if their scores are no smaller than the 
upper bound computed. 

Different from the above works, Zhang et al. [22] treat the redun- 
dancy of a document with respect to a set of relevant documents as 
a binary value, i.e., a document is either redundant or should be 
reported as relevant. Mei et al. [18] find the top-k diversified pres- 
tige nodes in information networks using vertex-reinforced random 
walk. Zhu et al. [23] recommend top-k diversified relevant queries 
using a manifold based approach. 

Maximum weight independent set: Our problem can be viewed 
as an instance of finding maximum weight independent set con- 
strained with size k, which is NP-hard [10]. The problems of find- 
ing maximum weight independent set, maximum weight clique, 
and minimum weight vertex cover are all correlated, and these 
problems are hard to approximate. Therefore, very few attempts 
have been done in the literature to find exact solutions, except the 
branch-and-bound methods [21, 3, 19, 4]. However, these works 
do not consider the size constraint k as introduced in our problem. 
Also, in our problem, the diversity graph is not toally materalized. 

10. CONCLUSION 

In this paper, we study the diversified top-fc search problem, 
that take both the scores of results and diversity into considera- 
tion. We formally define the problem using the similarity of search 
results themselves. We propose a framework, such that most ex- 
isting solutions that handle top-fc query processing with early stop 
can be used in our framework to handle diversified top-fc search by 
applying three new functions, namely, a sufficient stop condition 



sufficient(), a necessary top condition necessaryQ and an diversi- 
fied top-fc search algorithm div-search-current() to search on the 
current result set. We study all the three functions in details and 
give three algorithms for div-search-current(). We conducted ex- 
tensive performance studies to show the performance of our algo- 
rithms. 
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